Data Type Constraints in Data Analyzer Jobs

The data quality feature for Databricks in the Lazsa Platform uses PyDeequ framework. Pydeequ is a Python wrapper for Deequ, which is a library built on top of Apache Spark for defining unit tests for data. This measures the data quality in large datasets.

Using the data type constraint in data analyzer, you can analyze the data types in a particular column of a dataset.

When you apply the data type constraint in the data analyzer stage, PyDeequ analyses the data in your columns and categorizes it into different data types.

Histograms and Bins

In PyDeequ histograms and bins are concepts related to analysing data distributions and defining data quality constraints based on these distributions. While a histogram is a statistical representation of the distribution of data values in a column of a data frame, bins are intervals or ranges into which data values are grouped in a histogram.

Consider a scenario where you create a data analyzer job for Databricks and select the data type constraint. After running the job, you see additional entries present in the output for the data type constraint. This is because PyDeequ uses histograms and bins to analyse the data in your columns and categorizes it into different data types.

Example

Let's say you have a column named “Age”: with values [25, 30, "twenty-five", 40, "unknown"]. When you run a data type constraint check using PyDeequ, it generates a histogram as follows:

  • Histogram.bins: ['Integral', 'String']

  • Histogram.abs.Integral: 3 (values: 25, 30, 40)

  • Histogram.ratio.Integral: 0.6 (60% of the total values)

  • Histogram.abs.String: 2 (values: "twenty-five", "unknown")

  • Histogram.ratio.String: 0.4 (40% of the total values)

This indicates that 60% of the values in the age column are integers and 40% are strings.

The output reflects these additional entries generated by the data type constraint.